Skip to content

feat: Expand benchmark, update params#24

Merged
Pringled merged 20 commits into
mainfrom
expand-benchmark
Apr 17, 2026
Merged

feat: Expand benchmark, update params#24
Pringled merged 20 commits into
mainfrom
expand-benchmark

Conversation

@Pringled
Copy link
Copy Markdown
Member

@Pringled Pringled commented Apr 17, 2026

The PR expands the benchmark from 29 repos (12 languages) to 66 repos (20 languages) for a total of 1318 queries. The main metric is also changed to the mean or per-language means to get a more balanced view of how well semble works across languages.

This means the old benchmark scores are not valid anymore. There's also a few params that are tuned.

Current dataset overview:

Language Repos Projects
bash 3 bash-it, bats-core, nvm
c 3 curl, libuv, redis
cpp 3 abseil-cpp, fmtlib, nlohmann-json
csharp 3 dapper, messagepack-csharp, newtonsoft-json
dart 3 dio, http-dart, riverpod
elixir 3 ecto, phoenix, plug
go 3 chi, cobra, gin
haskell 3 aeson, pandoc, xmonad
java 3 commons-lang, gson, jackson-databind
javascript 3 axios, express, redux
kotlin 3 exposed, kotlinx-coroutines, ktor
lua 3 lazy.nvim, mini.nvim, telescope.nvim
php 3 guzzle, laravel-framework, monolog
python 9 aiohttp, click, fastapi, flask, httpx, model2vec, pydantic, requests, starlette
ruby 3 rack, rails, sinatra
rust 3 axum, serde, tokio
scala 3 cats, circe, http4s
swift 3 alamofire, rxswift, vapor
typescript 3 trpc, vitest, zod
zig 3 zig, zig-clap, zls

Pringled added 20 commits April 16, 2026 12:25
Remove tasks where target files moved to external packages in newer
versions (express v5 router/middleware, chi cors/redirect, ecto
migration, phoenix template, rack session). Fix paths for jackson-
databind BeanDeserializer, kotlinx-coroutines CoroutineContext,
nlohmann-json json_pointer, and circe DecodingFailure.
- NL alpha 0.6 -> 0.5: equal weight semantic + BM25 (BM25 finds targets
  2.3x more often than semantic among failure queries)
- Stem boost multiplier 0.5 -> 1.0: stronger file-path keyword signal
- Match ratio threshold 0.20 -> 0.10: boost files when any keyword
  matches, even for longer queries

NDCG@10 on 50-repo benchmark: 0.838 -> 0.851 (+0.013)
Add semantic/architecture/symbol categories to 212 tasks across 14 repos
that were missing them. Add 11 new express tasks to restore coverage
after broken annotations were removed (9 -> 20 tasks).

Total: 930 tasks across 48 repos, all categorized.
- commons-lang: reflectionEquals span 89-99 -> 179-318 (class header
  is not the reflection logic)
- circe: auto/semiauto derivation target was Decoder.scala (wrong file),
  now points to generic/auto.scala + semiauto.scala
- exposed: SchemaUtils target was abstract SchemaUtilityApi.kt, now
  points to the concrete SchemaUtils.kt in exposed-jdbc
- sinatra: halt/pass/redirect span too narrow, use whole-file
- sinatra: Rack build() method span was setup_default_middleware helper,
  now points to the actual build() method at line 1670
- sinatra: Helpers symbol span extended to cover halt (1028) and pass (1036)
guzzle +5, ktor +4, sinatra +4, messagepack-csharp +3, alamofire +3,
tokio +3, trpc +3, cats +3. All repos now have >= 20 tasks.
Total: 954 tasks across 48 repos.
- Add curl, redis, bats-core, aeson, http-dart, telescope.nvim, lazy.nvim, zig
- 160 new annotation tasks (20 per repo)
- Add .bash, .zig, .hs file extensions to file_walker
- Overall NDCG@10: 0.841 across 56 repos
…mean-of-language-means

- Add 10 new repos: nvm, bash-it (replaces gitflow-avh), pandoc, xmonad,
  dio, riverpod, nvim-lspconfig, mini.nvim, zls, zig-clap
- Bring bash, haskell, dart, lua, zig all to 3+ repos
- Fix run_benchmark.py aggregation: headline NDCG@10 is now mean of
  per-language means (one vote per language, not per repo), which previously
  over-weighted Python's 9 repos
- Fix numpy float type annotation issue (float() cast on np.median)
- New headline: NDCG@10 = 0.829 across 20 languages (66 repos)
… annotation audit

- Fix n_relevant to use annotation count instead of index coverage (reviewer #5)
- Add per-category NDCG@10 to printed summary and saved JSON (reviewer #7)
- Replace 11 trivially-lexical semantic queries with vocabulary-diverse alternatives
- Baseline: NDCG@10 = 0.825 (architecture=0.773, semantic=0.823, symbol=0.943)
…ent-scoped one

- ktor: the server application query targeted files outside the benchmark_root;
  replaced with a client-side plugin pipeline query that indexes correctly
- rxswift: Observable.swift is a thin declaration file; corrected relevant target
  to ObservableType.swift which contains the actual protocol definition
- Swift +0.006, Kotlin +0.004, architecture category +0.002
- sinatra: fix 3 queries pointing to wrong/narrow line ranges in base.rb
- circe: replace out-of-scope generic derivation query (targets modules/generic/
  which is outside benchmark_root) with DecodingFailure/ParsingFailure query
  targeting Error.scala in core
- cats: replace Semigroup/Monoid query pointing to kernel/ module (outside root)
  with MonoidK/SemigroupK query targeting core
- rxswift: add Zip+arity.swift as second relevant for zip operator query
- exposed: add Transactions.kt as second relevant for transaction block query

NDCG@10: 0.825 (baseline) -> 0.830
Remove outdated result files from previous benchmark runs and add
fresh result from current HEAD (NDCG@10=0.830).
- Remove nvim-lspconfig (4th lua repo, lowest score 0.583) to keep
  all languages at 3 repos
- Fix bash-it and libuv annotations using non-standard 'api' and
  'keyword' categories; remap to 'architecture' and 'symbol'
- Refresh benchmark results: NDCG@10 = 0.833
@Pringled Pringled merged commit 256b839 into main Apr 17, 2026
8 checks passed
@Pringled Pringled deleted the expand-benchmark branch April 22, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant